Combating Web Spam with TrustRank
نویسندگان
چکیده
Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.
منابع مشابه
Fast Asynchronous Anti-TrustRank for Web Spam Detection
Web spam detection is an important problem in Web search. Since Web spam pages tend to have a lot of spurious links, many Web spam detection algorithms exploit the hyperlink structure between the Web pages to detect the spam pages. Anti-TrustRank algorithm is a well-known link-based spam detection algorithm which follows the principle that spam pages are likely to be referenced by other spam pa...
متن کاملLink-Based Characterization and Detection of Web Spam
We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. ...
متن کاملLink-Based Spam Algorithms in Adversarial Information Retrieval
Web spam has become one of the most exciting challenges and threats to Web search engines. The relationship between the search systems and those who try to manipulate them came up with the field of adversarial information retrieval. In this paper, we have set up several experiments to compare HostRank and TrustRank to show how effective it is for TrustRank to combat Web spam and we have also re...
متن کاملPropagating Trust and Distrust to Demote Web Spam
Web spamming describes behavior that attempts to deceive search engine’s ranking algorithms. TrustRank is a recent algorithm that can combat web spam by propagating trust among web pages. However, TrustRank propagates trust among web pages based on the number of outgoing links, which is also how PageRank propagates authority scores among Web pages. This type of propagation may be suited for pro...
متن کاملSIGIR 2006 Workshop on Adversarial Information Retrieval on the Web AIRWeb 2006
We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. ...
متن کامل